Game Rating Exploration by Rakpong Kittinaradorn

Date: 19 November 2016

This report explores game rating from IGN. Summary and structure of dataset are as follows.

##        X           score_phrase 
##  Min.   :    0   Great   :4772  
##  1st Qu.: 4657   Good    :4741  
##  Median : 9312   Okay    :2945  
##  Mean   : 9312   Mediocre:1959  
##  3rd Qu.:13968   Amazing :1804  
##  Max.   :18624   Bad     :1269  
##                  (Other) :1134  
##                                      title      
##  Cars                                   :   10  
##  Madden NFL 07                          :   10  
##  Open Season                            :   10  
##  Brain Challenge                        :    9  
##  LEGO Star Wars II: The Original Trilogy:    9  
##  Madden NFL 08                          :    9  
##  (Other)                                :18567  
##                                             url       
##  /games/aladdin/gba-566703                    :    2  
##  /games/big-league-sports/wii-14275098        :    2  
##  /games/blur/xbox-360-14222096                :    2  
##  /games/call-of-duty-modern-warfare-2/ps3-2550:    2  
##  /games/crash-twinsanity/ps2-667247           :    2  
##  /games/defiance/pc-71832                     :    2  
##  (Other)                                      :18612  
##           platform        score             genre      editors_choice
##  PC           :3370   Min.   : 0.50   Action   :3797   N:15107       
##  PlayStation 2:1686   1st Qu.: 6.00   Sports   :1916   Y: 3517       
##  Xbox 360     :1630   Median : 7.30   Shooter  :1610                 
##  Wii          :1366   Mean   : 6.95   Racing   :1228                 
##  PlayStation 3:1356   3rd Qu.: 8.20   Adventure:1174                 
##  Nintendo DS  :1045   Max.   :10.00   Strategy :1071                 
##  (Other)      :8171                   (Other)  :7828                 
##   release_year  release_month     release_day  
##  Min.   :1996   Min.   : 1.000   Min.   : 1.0  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 8.0  
##  Median :2007   Median : 8.000   Median :16.0  
##  Mean   :2007   Mean   : 7.139   Mean   :15.6  
##  3rd Qu.:2010   3rd Qu.:10.000   3rd Qu.:23.0  
##  Max.   :2016   Max.   :12.000   Max.   :31.0  
## 
## 'data.frame':    18624 obs. of  11 variables:
##  $ X             : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ score_phrase  : Factor w/ 11 levels "Amazing","Awful",..: 1 1 6 6 6 5 2 1 2 5 ...
##  $ title         : Factor w/ 12589 levels ".deTuned",".hack//G.U. Vol. 1: Rebirth",..: 5702 5703 9767 7249 7249 11405 2908 4446 2908 11405 ...
##  $ url           : Factor w/ 18577 levels "/games/0-d-beat-drop/xbox-360-14342395",..: 8390 8387 14319 10813 10812 16931 4271 6526 4270 16932 ...
##  $ platform      : Factor w/ 59 levels "Android","Arcade",..: 39 39 15 58 36 20 58 33 36 33 ...
##  $ score         : num  9 9 8.5 8.5 8.5 7 3 9 3 7 ...
##  $ genre         : Factor w/ 113 levels "","Action","Action, Adventure",..: 65 65 70 95 95 106 39 83 39 106 ...
##  $ editors_choice: Factor w/ 2 levels "N","Y": 2 2 1 1 1 1 1 2 1 1 ...
##  $ release_year  : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ release_month : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ release_day   : int  12 12 12 11 11 11 11 11 11 11 ...

Univariate Plots Section

In this section, I will plot many histograms see the distribution of each feature.

Distribution of score_phase is lighly left-tailed. Next I will plot the distribution of score which I expect the similar distribution.

Score and score_phase distributions are similar as expected.

Most of the games are not picked as the editor’s choice.

Above 2 plots show score distribution of game with and without editor’s choice.

Total number of games in each year increase since 1997 to 2008 and decline after that.

Plot shows market share of 10 most popular platform. PC is the winner. It has approximately the same share as all playstations combine together.

Action is the most popular genre of all time.

Univariate Analysis

What is the structure of your dataset?

There are 18625 entries in this dataset with 11 features (X, title, url, score_phrase, score, platform, genre, editors_choice, release_year, release_month, release_day). X is just the index while title and url are specific to each game. I will not include these three features in the analysis. Release_year, release_month and release_day can be combined into one single feature called release_date. There is one factor feature that I order it myself, namely score_phrase. The levels are as follows.

Disaster < Unbearable < Painful < Awful < Bad < Mediocre < Okay < Good < Great < Amazing < Masterpiece

What is/are the main feature(s) of interest in your dataset?

I am interested in score, genre and platform. I would like to examine which platform should a gamer buy such that he/she can play a lot of high quality games.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Release_date will support the investigation in determining time development of game and platform. While editors_choice will help me filter high quality games.

Did you create any new variables from existing variables in the dataset?

Yes, I created release_date by combining release_year, release_month and release_day.

game$release_date <- game$release_year + 
  (game$release_month-1)/12 + 
  (game$release_day-1)/(12*31)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the distributions are lightly skewed, so no transformation is required here.

Bivariate Plots Section

This plot shows count of game for each popular platform. It is now hard to make a comparison between years because total number of game going up and down throughout the years. In the next plot I will make y-axis percent of game instead of count to make easier comparison.

We can now see that PC is the most consistent platform in term of game number.

Red line is the average score. It tends to increase over time.

Release day is not a contributing factor to score.

Solid line is median and dashed line are first and ninth quantile. Variation in score throughout 12 months is 0.5 on average.

## # A tibble: 12 × 2
##    release_month score_median
##            <int>        <dbl>
## 1              1          7.0
## 2              2          7.5
## 3              3          7.4
## 4              4          7.3
## 5              5          7.1
## 6              6          7.2
## 7              7          7.1
## 8              8          7.5
## 9              9          7.6
## 10            10          7.5
## 11            11          7.3
## 12            12          7.0

The count is again hard to compare because total game number is changing. I then create another plot with percent count.

Action has around 20% market share throughout the years. Other genres rise and fall alternately. The following table shows top-score genre with more than 100 games.

## # A tibble: 6 × 3
##               genre genre_median_score number_of_game
##              <fctr>              <dbl>          <int>
## 1               RPG                7.9            980
## 2 Action, Adventure                7.7            765
## 3       Action, RPG                7.7            330
## 4          Fighting                7.5            547
## 5        Platformer                7.5            823
## 6            Puzzle                7.5            776

Box-plot shows summary of score distribution of popular genre. There are not significantly different from each other.

Number of platforms tends to increase over time. Next I will investigate the behavior of editor over time.

Editor tends to pick more games lately.

High score game can be both in and out of editor’s choice.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

While some platforms have a good amount of games in recent year (ex. iPhone) and some are more popular in old day, PC and PlayStation Series always have consistent number of games throughout the period of interest. Game score does not depend on day release but it tends to increase slightly with year. The best month to release a game is September. The worst are January and December.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Percentage of editors_choice’s game tends to increase with time. This is related to the uptrend in score with time. Number of gaming platform is increasing. Some of the top score games (score > 8) are not picked as the editors_choice, this suggest that the editors must have other criteria in picking their choice.

What was the strongest relationship you found?

Overall gaming standard is inflating, ie. higher score, more platform, more editor’s choice game.

Multivariate Plots Section

This plot is quite hard to interprete. Quality of games for each platform fluctuates over time.

This shows the continuity of play station series.

Some genres such as RPG evolve over time in term of quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Most of the mean score in popular genre group are consistent except “Action, adventure” which is on decline from 1995 to 2007 and “RPG” which was rising rapidly in the period around 1995 to 1998.

Were there any interesting or surprising interactions between features?

Among PlayStation series, after the new version is released (ie. PlayStation 2,3,4), games for the old version (ie. PlayStation 1,2,3) usually perform better!


Final Plots and Summary

Plot One

Description One

Total number of games was rising from 1997 to 2008 and falling after 2008. This plot gives the overall view of the gaming industry throughout history.

Plot Two

Description Two

PC and PlayStation series are the most consistent platform in term of number of games. If gamers want to have many gaming options available, PC and PlayStation are their choice.

Plot Three

Description Three

Throughout the years, most of the mean score for each genre are quite constant. Except those related to RPG, they are on the rise. While those involved action are on decline.


Reflection

This dataset is about game rating from ign.com, a famous game website. It involves over 18000 game from 1996 to 2016. It spans most of the gaming platform and game genre available in this period. I start exploring this dataset by plotting frequancy of each variables. By doing this, I got the overall understanding of this dataset. Trend in gaming industry is understood in this period of investigation. Is is peaked in 2008 and has been declining since then. Next I start comparing two different variables bt mean of scatter plot, line plot, stacked bar plot and box plot. Evolution of game score, platform and genre are investigated in this period. Lastly I plot multivariable graph to examine three variables simultaneously. By exploring this dataset, trending in genre can also been seen. I can also see which platform is transcient and which stand over a test of time.

Univariate and bivariate plots are straighforward to generated and explore. But it is quite hard to create a meaningful multivariable plots. In the future, using more complicated plot such as heat map should give additional insights to the analysis.